Building Compact Lexicons for Cross-Domain SMT by Mining Near-Optimal Pattern Sets
نویسندگان
چکیده
Statistical machine translation models are known to benefit from the availability of a domain bilingual lexicon. Bilingual lexicons are traditionally comprised of multiword expressions, either extracted from parallel corpora or manually curated. We claim that “patterns”, comprised of words and higher order categories, generalize better in capturing the syntax and semantics of the domain. In this work, we present an approach to extract such patterns from a domain corpus and curate a high quality bilingual lexicon. We discuss several features of these patterns, that, define the “consensus” between their underlying multiwords. We incorporate the bilingual lexicon in a baseline SMT model and detailed experiments show that the resulting translation model performs much better than the baseline and other similar systems.
منابع مشابه
Building a Bilingual Lexicon Using Phrase-based Statistical Machine Translation via a Pivot Language
This paper proposes a novel method for building a bilingual lexicon through a pivot language by using phrase-based statistical machine translation (SMT). Given two bilingual lexicons between language pairs Lf–Lp and Lp–Le, we assume these lexicons as parallel corpora. Then, we merge the extracted two phrase tables into one phrase table between Lf and Le. Finally, we construct a phrase-based SMT...
متن کاملStatistical Machine Translation without Parallel Data
We examine approaches of statistical machine translation without parallel data (SMT). SMT has achieved impressive performance by leveraging large amounts of parallel data in the source and target languages. But such data is available only for a few language pairs and domains. Using human annotation to create new parallel corpora sufficient for building a good translation system is too expensive...
متن کاملContrast Pattern Mining and Its Application for Building Robust Classifiers
The ability to distinguish, differentiate and contrast between different data sets is a key objective in data mining. Such ability can assist domain experts to understand their data and can help in building classification models. This presentation will introduce the techniques for contrasting data sets. It will also focus on some important real world applications that illustrate how contrast pa...
متن کاملLarge SMT data-sets extracted from Wikipedia
The article presents experiments on mining Wikipedia for extracting SMT useful sentence pairs in three language pairs. Each extracted sentence pair is associated with a cross-lingual lexical similarity score based on which, several evaluations have been conducted to estimate the similarity thresholds which allow the extraction of the most useful data for training three-language pairs SMT system...
متن کاملEvaluation of Context-Dependent Phrasal Translation Lexicons for Statistical Machine Translation
We present new direct data analysis showing that dynamically-built context-dependent phrasal translation lexicons are more useful resources for phrase-based statistical machine translation (SMT) than conventional static phrasal translation lexicons, which ignore all contextual information. After several years of surprising negative results, recent work suggests that context-dependent phrasal tr...
متن کامل